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Abstract 

We present an easy to use and flexible grid library for developing highly scal- 
able parallel simulations. The distributed cartesian cell-refinable grid (dccrg) sup- 
ports adaptive mesh refinement and allows an arbitrary C++ class to be used as 
cell data. The amount of data in grid cells can vary both in space and time al- 
lowing dccrg to be used in very different types of simulations, for example in fluid 
and particle codes. Dccrg transfers the data between neighboring cells on diff'erent 
processes transparently and asynchronously allowing one to overlap computation 
and communication. This enables excellent scalability at least up to 32 k cores 
in magnetohydrodynamic tests depending on the problem and hardware. In the 
version of dccrg presented here part of the mesh metadata is replicated between 
MPI processes reducing the scalability of adaptive mesh refinement (AMR) to be- 
tween 200 and 600 processes. Dccrg is free software that anyone can use, study 
and modify and is available at https://gitorious.org/dccrg. Users are also kindly 
requested to cite this work when publishing results obtained with dccrg. 
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8 Programming language: C++ 
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17 Nature of problem: 

18 Grid library supporting arbitrary data in grid cells, parallel adaptive mesh refinement, 

19 transparent remote neighbor data updates and load balancing. 

20 Solution method: 

21 The simulation grid is represented by an adjacency list (graph) with vertices stored into a 

22 hash table and edges into contiguous arrays. Message Passing Interface standard is used 

23 for parallelization. Cell data is given as a template parameter when instantiating the grid. 

24 
25 

26 Restrictions: 

27 Logically cartesian grid. 

28 Additional comments: 

29 

30 Running time: 

31 Running time depends on the hardware, problem and the solution method. Small problems 

32 can be solved in under a minute and very large problems can take weeks. The examples 

33 and tests provided with the package take less than about one minute using default options. 

34 In the version of dccrg presented here the speed of adaptive mesh refinement is at most of 

35 the order of 10^ total created cells per second. 
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44 1. Introduction 

45 During the rising phase of the solar cycle, it is becoming more important to 

46 understand the physics of the near-Earth space. The dynamical phenomena caused 

47 by the constant flow of magnetized coUisionless plasma from the Sun creates 

48 space weather that may have harmful effects on space-borne or ground-based tech- 

49 nological systems or on humans in space. While the physics of space weather is 

50 being studied with in situ instruments (e.g. NASA's Radiation Belt Storm Probes 

51 launched in 2012-08-3(Q) and by means of remote sensing, it is also important 

52 to model the near-Earth space with numerical simulations. The simulations can 

53 be used both as a context to the one-dimensional data sets from obsevations, as 

54 well as a source to discover new physical mechanisms behind observed variations. 

55 Present large scale (global) simulations are based on computationally light-weight 

56 simplified descriptions of plasma, such as magnetohydrodynamics (MHD, [fll, 

57 O, [El and flU). On the other hand the complexity and range of spatial scales 

58 (from less than 10' to over 10^ km) in space weather physics signifies the need to 

59 incorporate particle kinetic effects in the modeled equation set in order to better 

60 model, for example, magnetic reconnection, wave-particle interactions, shock ac- 

61 celeration of particles, ring current, radiation belt dynamics and charge exchange 

62 (see e.g. jH for an overview). However, as one goes from MHD towards the full 

63 kinetic description of plasma (from hybrid PIC flU and Vlasov to full PIC (HI, 

64 [|9i), the computational demands increase rapidly, indicating that the latest high 

65 performance computing techniques need to be incorporated in the design of new 

66 simulation architectures. 

67 As the number of cores in the fastest supercomputers increases exponentially 

68 the parallel performance of simulations on distributed memory machines is be- 

69 coming crucial. On the other hand, utilizing a large number of cores efficiently in 

70 parallel is challenging especially in simulations using run-time adaptive mesh re- 

71 finement (AMR). This is largely a data structure and an algorithm problem albeit 



http://www.nasa.gov/mission_pages/rbsp/main/index.html 
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72 specific to massively parallel physical simulations running on distributed memory 

73 machines. 

74 In computer simulations dealing with, for example, continuous matter (a fluid) 

75 the simulated domain is discretized into a set of points or finite volumes which we 

76 will refer to as cells. At any given cell the numerical solution of a differential 

77 equation describing the problem often depends only on data within a (small) part 

78 of the simulated volume. This is true for a single time step in a solver for a 

79 hyperbolic problem or a single iteration in a solver for an elliptic problem. This 

80 spatial data dependency can be implemented implicitly in the solver function(s) 

81 or explicitly as a separate grid library used by the application. 

82 In a simple case the number of cells in the simulation stays constant and the 

83 data dependency of each cell is identical allowing cell data to be stored in an 

84 array whose size is determined at grid creation and the spatial neighbors to be 

85 represented as indices into this array. A straightforward AMR extension of this 

86 concept is to create additional nested grids in specific parts of the simulation do- 

87 main with higher resolution. By solving each grid separately and interpolating the 

88 results from finer grids into coarser grids one does not have to modify the solver 

89 functions. This technique is used extensively for example by Berger (see [lOJ for 

90 some of the earliest work) and by [1 IJ and [12J. In the rest of this work however 

91 we will concentrate on AMR implementations in which additional overlapping 

92 grids are not created but instead cells of the initial grid are refined, i.e. replaced 

93 with multiple smaller cells. 

94 A generic unstructured grid (as provided for example by libMESH lfT3]| ) does 

95 not admit as simple a description as above and is generally described by a directed 

96 graph in which vertices represent simulation cells and directed edges represent the 

97 data dependencies between cells. Unfortunately the nomenclature of graph theory 

98 and geometry overlap to some extent and discussing both topics simulataneously 

99 can lead to confusion. Figure [T] shows the nomenclature we use from this point 

100 forward, the standard graph theoretical terms are given in parentheses for refer- 

101 ence. A cell is a natural unit in simulations using the finite volume method (FVM) 

102 and hereinafter we will use the term cell instead of vertex when discussing graphs. 

103 Also an edge in FVM simulations usually refers to the edges of a cube represent- 

104 ing the physical volume of a cell, and hence we will use the term arrow to refer 

105 to a directed edge in a graph. Furthermore we note that each cell in the grid can 

106 also represent, for example, a block of cells similarly to [3], but for the purposes 

107 of this work the actual data stored in grid cells is largely irrelevant. 

108 Since a graph can also be used to represent the cells and arrows of grids sim- 

109 pier than an unstructured mesh, the question arises how does a particular program 
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Nomenclature 

Geometry Graph theory 




Figure 1 : The nomenclature used in this work for geometry and graph theory. Standard graph 
theoretical terms are given in parentheses for reference. 



110 implement its graph representation of the simulated system, e.g. what simplify- 
in ing assumptions have been made and how is the graph represented in memory. 

112 A popular representation in (M)HD AMR simulations is to have a fixed number 

113 of arrows directed away from each source cell and to store the arrows as native 

114 pointers to the destination cells. In case a cell does not exist all arrows pointing to 

115 it are invalidated in neighboring cells. This technique has been used with diff"erent 

116 variations by fT^l, f[5\, fT6l and fTT], for example. 

117 There are several possibilities for representing the cells and arrows of a graph, 

118 for example an adjacency list or an adjacency matrix [fT8l . In physical simulations 

119 the number of arrows in the graph is usually of the same order as the number of 

120 cells in which case a suitable representation is an adjacency list. In an adjacency 

121 list the cells of the graph are separate objects and each cell stores the arrows 

122 pointing to and/or from that cell. The cells of the graph and the arrows of each 

123 cell can be stored in diff"erent types of data structures. For example the cells are 

124 stored in a contiguous array (representing a linear octree) in |fT9l , [|20]| . ETl and 

125 [[221, a hash table in [23] and a (doubly) linked list in [14]. On the other hand the 

126 arrows of each cell are stored in a fixed size array of native pointers in [,14,1 and as 

127 single bits in fl^. 

128 In this work we introduce the distributed cartesian cell-refinable grid (dccrg) 
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129 for rapid development of parallel simulations using, for example, finite volume or 

130 finite element methods (FEM). In dccrg the graph is represented by an adjacency 

131 list in which cells are stored into a hash table, while the arrow lists of cells are 

132 stored into contiguous arrays. We describe the details of the graph representation 

133 in Section |2] In section |3] we describe the C++ implementation of dccrg and 

134 present its unique features with respect to other published grid codes: arbitrary 

135 data in grid cells, transparent updates of remote neighbor data, user-selectable 

136 neighborhood size for cells and ease of use. In section |4] we test the scalability 

137 of dccrg using a variety tests in one, two and three dimensions and draw our 

138 conclusions in Section|5l 



140 
141 
142 
143 



147 
148 
149 



2. Implementation of the grid graph 



Dccrg represents the grid as an adjacency list in which cells are stored into 
a hash table. A hash table has one clear advantage over a tree when used to 
store the grid: cells can be accessed, inserted and deleted in constant amortized 
time regardless of the number of cells and their physical size and location. Thus 

144 neither the total number of cells nor the number of refinement levels affect the 

145 simulating performance of a single core. Each cell is associated with a unique id 

146 which we use as a key into the hash table. A potential drawback of a hash table is 
the computational cost of the hash function, but according to our tests the cost is 
usually not important. The time to solve one flux between two cells in the MHD 



tests presented in Section 4. 1 is about four times larger than accessing one random 

150 cell in the hash table. The cell access time can be optimized further, for example, 

151 by storing and solving blocks of cells instead of single cells as is done in [|3l and 



152 



discussed further in Section [3.1.1[ 



153 2.1. Mapping cell ids to a physical location 

154 Our cell ids are globally unique integers which offers several advantages: 1) 

155 The cell id can be calculated locally, i.e. without communication with other pro- 

156 cesses, 2) The neighbors of a cell can be stored as cell ids instead of pointers that 

157 are not consistent across computing nodes, 3) The cost of computing the hash 

158 function values is minimized, 4) The memory required for storing the cell ids is 

159 small. 

160 Figure |2] shows an example of mapping cells to unique ids which is done as 

161 follows: cell ids within each refinement level increase monotonically first in x 

162 coordinate, then in y and then in z, with cells at refinement level represented by 

163 numbers from 1 to A^o, refinement level 1 by numbers from No + lioNo + l -<rNi, 
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z Cell ids and indices in the example grid 




Refinement level Ref level 1 Max ref level 



Figure 2: An example dccrg grid of size 1 in each dimension in cells of refinement level 0, with a 
maximum refinement level 2, showing the ids and indices of all possible cells. 



164 etc. Cells at refinement level / + 1 are half the size of cells at refinement level / 

165 in each dimension. Cells at equal refinement level are identical in size. Hence in 

166 three dimensions A'^o = ^ = ^ = ••• = n^ny.n^, where n^, Uy and n, are the grid size 

167 in cells of refinement level in the x, y and z dimensions respectively. Hereinafter 
cell size refers to the logical size of cells assuming a homogeneous and isotropic 
cartesian geometry. 



168 
169 
170 



When searching for the neighbors of a cell in the hash table (see Section 2.2) 

171 it is convenient to use the concept of cell indices: the location of each cell in the 

172 grid is represented by one number per dimension in the interval [0, l^ni - 1 ] where 

173 L is the maximum refinement level of the grid and n, is n^, Uy or respectively. 

174 Figure [2] shows the possible cell indices for an example grid with n,- = Uy = = \ 

175 and L = 2. The size of a single cell of refinement level / is 2^"' indices in each 

176 dimension. A cell spanning more than one index is considered to be located at 

177 indices closest to the origin of the grid, for example cell #3 in Figure |2] is located 

178 at indices (2, 0, 0). Similarly to [24] there is a one-to-one mapping between cell 

179 ids and cell indices plus refinement levels, e.g. in addition to its id a cell can be 

180 uniquely identified by its indices and refinement level. 

181 In the current implementation of dccrg a cell is refined by creating all of its 

182 children; in the example grid of Figure [2]refining cell #1 would create cells #2. ..#9. 

183 In principle this is not required and more complex grid structures are possible in 

184 which, for example, the grid in Figure |2] would consist of cells #1 and #3 alone. 
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185 Such an approach has been found useful by [fT?]| . Complete refinement of cells in 

186 our case was a practical decision based on our current simulation needs and it also 

187 simplifies the neighbor searching code and enables optimizations described in the 

188 next section. 

189 2.2. Neighbor searching 

190 In dccrg all cells existing within a certain minimum distance from local cells 

191 (cells owned by the process) are stored in a hash table with the cell id as the key 

192 and the process owning the cell as the value. Since the mapping of cell ids to a 

193 location is unique, finding the neighbors of a cell in the hash table is straightfor- 

194 ward: for all indices neighboring a given cell the hash table is searched for cells of 

195 all applicable refinement levels. Figure |3] shows an example of neighbor searching 

196 in a grid with = 2,ny = ny = 1,L = 3 and a neighborhood size of one. The 

197 sibhngs of cell #4 (#3, #7, #8, #11, #12, #15 and #16) are not shown for clarity 

198 and some potential neighbors of cell #4 in the positive x direction have also been 

199 omitted. In dccrg the cells' neighborhoods are measured in multiples of their own 

200 size, e.g. for cell #4 all cells that are (at least partially) between indices and 

201 1 1 inclusive in the x direction would be considered as neighbors. In order to find 

202 the neighbor(s) of cell #4 in positive x direction the hash table is searched for the 

203 smallest cell at indices (8, 0, 0) which in this case could correspond to any of the 

204 following cells: #2, #5, #23 or #155. If cell #23 is the smallest cell found in the 

205 hash table the search can stop since it is known that the siblings of cell #23 also 

206 exist (not shown, cells #24, #31, #32, #55, #56, #63 and #64) because all children 

207 of a cell are created when a cell is refined. On the other hand if cell #155 is the 

208 smallest cell found at indices (8, 0, 0) then the search would continue at indices 

209 outside of cell #155 and its siblings, for example at indices (10, 0, 0) or (8, 2, 0). 

210 The concept of hanging nodes or faces used in unstructured mesh codes is not 

211 directly applicable to dccrg because a single cell is the smallest unit that dccrg 

212 deals with. Since the user is responsible for defining what is stored in each cell 

213 he/she must also define, if required, the data stored at the faces, edges and vertices 

214 of cells. Hanging nodes, which are the result of cells of different size sharing an 

215 edge or a face, must be handled by the solvers used for a particular application. For 

216 example in GUMICS-4 [IJ where the magnetic field is separated into background 

217 and perturbed components the face average background magnetic field is required 

218 when solving the flux through a face. With AMR a face average value of the 

219 background field is required for every face of every cell because, for example, if 

220 only one face average is stored per dimension in cells it would not be possible to 

221 solve the flux between cells (#4, #2) and (#1, #5) in Figure |3| This is due to the 
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Figure 3: An example grid illustrating neighbor searching for cell #4 in the positive x direction. 
The siblings of cell #4 are not shown for clarity and also some of its potential neighbors in positive 
X direction have been left out. 



222 fact that the face average value of the smaller cell must be used in both cases and 

223 it is only available if every face of every cell stores the face average field. 

224 Currently dccrg enforces a maximum refinement level difference of 1 between 

225 neighboring cells. Hence it is sufficient to search for cells of three refinement 

226 levels I - + 1 when finding the neighbors of a cell of refinement level /. In 

227 principle the enforcement of maximum refinement level difference is not required. 

228 For example in Figure |3]cell #4 (refinement level 1) has cell #155 (refinement level 

229 3) as a neighbor but in the current version of dccrg such a situation is not permitted 

230 and searching the hash table for cells #2, #5 and #23 is sufficient for finding the 

231 neighbors of cell #4. This was a practical decision based on our experience with 

232 global MHD modeling of the Earth's magnetosphere using GUMICS-4. In future 

233 this restriction might be removed. A similar one is also used in [|24ll . 

234 Even though a maximum refinement level difference of one is enforced be- 

235 tween neighboring cells and searching for cells in the hash table is a quick op- 

236 eration, in practice the ids of neighbors of local cells are also stored explicitly. 

237 As mentioned in Section [T] this neighbor information corresponds to arrows be- 

238 tween cells in a graph and hence we will use the term arrow list to refer to the 

239 neighbor list of a cell. Dccrg stores both the arrows pointing away from and the 
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240 arrows pointing to local cells. With AMR it is possible that there exists only one 

241 arrow between two cells because cells' neighborhoods are measured in units of 

242 their own size. For example in Figure [2] with a neighborhood size of 1, cell #13 

243 would be considered a neighbor of cell #2 but cell #2 would not be considered a 

244 neighbor of cell #13. Explicitly storing the arrows to and from local cells enables 

245 fast iteration for example by user code (solvers, reconstruction functions, etc). In 

246 dccrg the arrow lists of local cells are stored as contiguous arrays. 

247 2.3. Algorithmic advantages 

248 The most important advantage that globally unique cell ids have over a tradi- 

249 tional graph implementation using native pointers between cells is that the arrows 

250 between cells are not required to be consistent during AMR or load balancing. 

251 For example when doing AMR, a pointer-based implementation has to be careful 

252 not to leave any dangling pointers and to update the pointers of all nearby cells 

253 in the correct order so as not to lose access to any cell. On the other hand with 

254 unique cell ids the arrow lists of cells can simply be emptied when needed and 



255 new neighbors searched in the hash table as described in Section 2.2 In addition 

256 to being easy to implement this method also admits simple thread-based paral- 

257 lelization inside dccrg due to different threads modifying only the arrow lists of 

258 different cells. 

259 It is also advantageous to use unique cell ids in arrow lists instead of pointers in 

260 a parallel program running on distributed memory hardware. In this environment 

261 a pointer to a neighboring cell is only valid on one process and cannot be used to 

262 refer to the same neighbor on other processes. On the other hand the same unique 

263 cell id can be used by all processes to refer to the same neighbor regardless of its 

264 actual location in memory. 

265 3. Implementation 

266 A separate grid library is a natural abstraction probably for any physical sim- 

267 ulation but especially for simulations using FVM where the concept of a grid and 

268 its cells' data dependencies are easy to define and implement. Thus following 

269 good software development practice dccrg is implemented independently of any 

270 specific physical problem or its solver, while still providing the flexibility required 

271 for various types of simulations, for example (M)HD, advection (e.g. Vlasov) and 

272 kinetic. 

273 Dccrg is written in C-I--I- which allows us to easily separate low level function- 

274 ality of dccrg into subclasses which higher-level classes can use thus also bene- 
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280 
281 



275 fiting from a modular internal implementation, a technique also used in [fT31 . For 

276 example the physical geometry of the grid is handled by a separate class which is 

277 also given to dccrg as a template parameter. This allows one to easily extend the 

278 grid geometries supported by dccrg. In the default homogeneous and cartesian ge- 
ometry cells of the same refinement level are identical in size in each dimensior|^ 

Here we describe the unique user-visible features of dccrg with respect to other 
grid codes and also present important features of the serial and parallel implemen- 

282 tation. 

283 3.1. Unique features 

284 3.1.1. Arbitrary data in grid cells 

285 The most important feature distinguishing dccrg from other grid codes is the 

286 possibility of trivially storing data of arbitrary type and size in the grid's cells 

287 by simply giving the class which is stored in the cells as a template parameter to 

288 dccrg when creating an instance of the grid. This also allows a single simulation 

289 to have several independent parallel grids with different geometries and different 

290 types of data stored in the grids' cells' . The amount of data can also vary between 

291 different cells of the same grid and in the same cell as a function of time. This is 

292 required for example in kinetic simulations where not only does the total number 

293 of particles change but also the number of particles in each grid cell varies. In 

294 dccrg this is handled by each instance of the user's cell data class providing a MPI 

295 datatype corresponding to the data to be sent from or received by that particular 



304 



296 cell. An example of this is presented in Section 3. 1 .4 

297 Completely arbitrary cell data can also be transferred between processes if the 

298 cell data class provides a serialize function which the MPI bindings of boost li- 

299 brary will use for transferring cell data between processed Although this method 

300 of transferring data between processes the most general it is also the slowest since 

301 data is first copied into a contiguous buffer by serialization and subsequently trans- 

302 ferred by MPI resulting in at least one additional copy the data being created com- 

303 pared to pure a MPI transfer. This is also the case in the SAMRAI framework [.25ll 



which supports transferring arbitrary patch data using the same technique. 



305 3.1.2. Automatic remote neighbor updates 

306 Dccrg can automatically transfer cell data between processes both for remote 

307 neighbor data updates and load balancing using simple function calls. Further- 



^https://gitorious.org/dccrg/dccrg/blobs/master/dccrg_constant_geometry.hg2 

■'User-defined data types in http://www.boost.org/doc/libs/l_49_0/doc/html/mpi/tutorial.htinl 
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308 more whenever cell data is sent between processes either one MPI message per 

309 cell can be used or, similarly to f25\ , all cells being sent to another process can 

310 be transferred using a single MPI message. Updating the remote neighbor data 

311 between processes is possible using several methods. The simplest one is the 

312 synchronous update function that updates the remote neighbor data between pro- 

313 cesses and returns once transfers have completed (see Section [3. 1.4 1. The most 



314 



fine-grained communication currently supported can be used by calling a separate 

315 function for initiating transfers and functions that wait for the sends and receives 

316 to complete respectively. A typical usage scenario would consist of the following: 

317 1. Start transferring remote neighbor data 

318 2. Solve the inner cells of the simulation (cells without remote neighbors) 

319 3. Wait for the data from other processes to arrive 

320 4. Solve the outer cells of the simulation (cells with at least one neighbor on 

321 another process) 

322 5. Wait for the data from this process to be sent 



323 The MHD scalability tests we present in section 4.1 use this procedure with the 

324 exception that step 5 is executed before step 4 due to the technical implementation 

325 of the GUMICS-4 MHD solver. 

326 3.1.3. User-selectable neighborhood size 

327 As mentioned in Section [T] the size of cells' neighborhood in simulations is 

328 highly problem / solver dependent. Specifically the problem / solver used in the 

329 simulation dictates the distance from which data is required at a cell in order to 

330 advance the simulation for one time step or one iteration in that cell. In many 

331 previously published grid codes the size of cells' neighborhood is restricted to 1 

332 either explicitly or implicitly. For example in [E^ in three dimensions a block (a 

333 single cell from the point of view of the grid) has 6 neighbors and it is assumed 

334 that a block consist of such a number of simulation cells that, for example, a solver 

335 needing data from a distance of 3 simulation cells can obtain that data from the 

336 neighboring block. Other examples are [|2B, [[II, O, ill, lllll and [[I71|- Dc- 

337 erg supports an arbitrarily large neighborhood chosen by the user when the grid is 

338 initialized. The size of the neighborhood can be any unsigned integer and all other 

339 cells within that distance of a cell (in units of size of the cell itself) will be consid- 

340 ered as a neighbors of the cell. This enables the use of high-order solvers with the 

341 added possibility of refining each neighboring cell individually. Naturally one can 

342 also store a sufficiently large block of simulation cells in one dccrg cell allowing 

343 one to use a small 6 cell neighborhood as done in [|24|. Zero neighborhood size is 
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344 a special case in dccrg signifying that only face neighbors of equal size are con- 

345 sidered neighbors (with AMR all of the refined neighbor's children are considered 

346 instead). For example in a periodic grid without AMR neighborhood sizes of 1 

347 and 2 would result in 26 and 124 neighbors per cell respectively. 

348 Naturally the size of the neighborhood affects the amount of data that must be 

349 transferred between processes during remote neighbor data updates regardless of 

350 the grid implementation thus affecting parallel scalability. Additionally since in 

351 dccrg a maximum refinement level difference of one is enforced between neigh- 

352 boring cells the size of the neighborhood does affect for example the amount of 



353 



induced cell refinement (see Section 3.3 1. 



354 3.1.4. Ease of use 

355 Even though initially dccrg was developed only for in-house use, it was nev- 

356 ertheless designed to be simple to use for the kinds of simulations it is targeted 

357 for. Figure |4] shows an example of a complete parallel program playing Conway's 

358 Game of Life using dccrg written in less that 60 lines of code (LOC) including 

359 whitespace and comments. Lines 10... 14 of the program define the class to be 

360 stored in every cell of dccrg along with member functions at and mpi_datatype 

361 which dccrg calls when querying the information required for transferring cell 

362 data between processes. The current state of a cell is saved into data [0] and the 

363 number of its live neighbors is saved into data[l] . On line 23 an instance of the 

364 grid is created with the class defined above as cell data. On line 24 the geometry 

365 of the grid is set to 10x10x1 cells at refinement level with minimum coordinate 

366 at (0, 0, 0) and cells of size 1 in each dimension. On line 25 the grid is initialized 

367 by setting the load balancing function to use in Zoltan, the neighborhood size and 

368 the maximum refinement level of cells. Lines 26 and 27 balance the load using 

369 Zoltan and collect the local cells. In this example the load is balanced only once 

370 and the local cell list does not change afterwards. On line 46 the non-existing 

371 neighbors of local cells are skipped. This is because the grid is initialized with 

372 non-periodic boundaries and neighbors that would be outside of the grid do not 

373 exist. 

374 Figure |5] shows relevant excerpts from a simple kinetic simulation showing the 

375 use of dccrg in the case of variable amount of cell data, the full program can be 

376 viewed in the dccrg git repositor)]^ The remote neighbor update logic in the main 

377 simulation loop consists of the following steps: 



'*https;//gitorious.org/dccrg/dccrg/blobs/master/tests/particles/simple.cpp 
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378 1 . The total number of particles in each cell is transferred between processes 

379 (lines 43 and 44) 

380 2. Space for receiving particle data is allocated in local copies of remote cells 

381 based on their received total number of particles in step 1 (lines 47. ..52) 

382 3. The particle coordinates are transferred between processes (lines 55 and 56) 

383 The cell data class of the example kinetic simulation must provide the correct in- 

384 formation to dccrg when updating remote neighbor data: The at and mpi_datatype 

385 functions now return a different address and number of bytes respectively depend- 

386 ing on whether the total number of particles or the particle coordinates are trans- 

387 ferred between processes. This is decided by the user in the main simulation loop. 

388 Additionally the resize function of the cell data class allocates memory for as 

389 many particles as is there are in number_of ^particles. A similar approach to 

390 the one described above is also used in our Vlasov simulation (further developed 

391 from [|71|) where each real space cell has a separate adaptable velocity grid for ions 

392 consisting of a variable number of 4^ cell velocity blocks. 

393 In the previous example two communications are required per time step be- 

394 cause processes receiving particle data do not know the number of incoming par- 

395 tides in advance. Since the MPI standard requires that the maximum amount of 

396 data to be received is known before calling the receive function the number of 

397 particles has to be communicated separately. This guarantees that processes re- 

398 ceiving particles can specify the size of the data to MPI and allocate the memory 

399 required for received particles. 

400 3.2. Load balancing / cell partitioning 

401 Load balancing is also accomplished easily with dccrg. A user can call the 

402 balance_load function to let the Zoltan (TP\ library create a new partition, and 

403 single cells can also be moved manually between processes using the pin and 

404 unpin functions. Most of Zoltan's load balancing method^can be used, namely: 

405 NONE, RANDOM, BLOCK, RGB, RIB, HSFC, GRAPH, HYPERGRAPH and 

406 HIER. In any case dccrg will transparently execute the new partition by transfer- 

407 ring the necessary cell data between processes using MPI. 

408 The structure of the grid in dccrg includes the owner of a cell in addition to the 

409 unique id of the cell (id is the key and owner is the value in a hash table). Thus the 

410 cell ids themselves are not used for partitioning cells between processes and any 



^http://www.cs.sandia.gov/Zoltan/ug_html/ug_alg.html#LB_METHOD 
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411 cell can be moved to any process (for example by using the RANDOM partitioner 

412 of Zoltan which we have found to be very useful for testing). 

413 3.3. Adaptive mesh refinement algorithm 

414 Due to the unique mapping of cells' ids and their physical location and size 

415 it is straightforward to refine any given cell in the grid, i.e. to calculate the ids 

416 of the children of any cell, and can be done locally (see Section [2]). In order 

417 to enforce a maximum refinement level difference of one between neighboring 

418 cells whenever a cell is refined the refinement level of all neighbors is checked. 

419 If the refinement level of any neighbor is less than that of the cell being refined 

420 that neighbor is also refined. This is continued recursively until no more cells 

421 need to be refined. The size of the cells' neighborhood affects induced refinement 

422 indirectly by changing the number of neighbors a cell has and hence the potential 

423 number of cells whose refinement will be induced. In dccrg a few simplifications 

424 have been made to AMR: 1) Any set of cells can only be refined once before 

425 calculating induced refinement (by stop_ref ining), e.g. induced refinement can 

426 only increase the refinement level of cells by one, and 2) Unrefining a cell does not 

427 induce unrefinement, e.g. any cell which has at least one neighbor with refinement 

428 level larger than the cell being unrefined (i.e. it has a smaller neighbor) cannot be 

429 unrefined. 

430 In a parallel setting the only difference to the above is that whenever a process 

431 refines or unrefines a cell that information has to be given to all processes which 

432 have cells within a certain distance of the cell that was refined or unrefined. Cur- 

433 rently this distance is equal to infinity, e.g. all changes to the structure of the grid 

434 (refines and unrefines) are communicated globally. This has a significant impact 

435 on the parallel scalability of AMR in dccrg and is discussed in Section |4] and a 

436 method for making this minimum distance finite is discussed in Section |5j The 

437 changes in grid structure are exchanged between processes after each recursive 

438 step of induced refinement which are continued until no more cells need to be 

439 refined. 

440 3.4. Parallel implementation 

441 Good scalability in distributed memory machines requires asynchronous point- 

442 to-point MPI communication between processes with minimal total amount of 

443 communication, and especially minimal amount of global communication. The 

444 number of MPI messages should also be minimized in order not to burden the net- 

445 work with unnecessary traffic. Therefore any process must know which processes 
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446 require data from local cells and from which remote cells data is required dur- 

447 ing remote neighbor updates without querying that information from other pro- 

448 cesses beforehand. Thus in dccrg every process knows the structure of the grid 

449 (e.g. which cells exist and which processes own them) to at least a certain dis- 

450 tance from any of its cells so it can calculate locally which cells' data to send and 

451 receive from other processes. Due to this dccrg does not require global commu- 

452 nication during ordinary time stepping, e.g. remote neighbor data updates, which 

453 enables excellent scalability when not doing load balancing or AMR as shown 

454 in section 4.1 Internally dccrg precalculates these send and receive lists when- 

455 ever the structure of the grid changes due to AMR or cells being moved between 

456 processes. 

457 Even though the replicated mesh metadata of dccrg does not include the data 

458 stored in each cell it nevertheless limits the size of dccrg grids in practice to less 

459 than 100 M existing cells. This number does not depend on the refinement level 

460 or physical location of the cells and has so far been more than sufficient for our 

461 needs. To our knowledge the only parallel grid library that does not have any 

462 persistent global data structures is [|20ll . In [[2T| and [|22| only the macro structure 

463 of the grid (i.e. cells of refinement level 0) is known by all processes. According 

464 to the authors this limits the size of the grid to the order of 10^. . . 10^ cells of 

465 refinement level but does not limit the number of smaller cells. 

466 4. Scalability results 

467 The time stepping scalability of dccrg (e.g. without AMR or load balancing) 

468 depends mostly on the hardware running the simulation and on three parameters 

469 specific to each simulated system: 1) the time required to solve the inner cells of a 

470 process, 2) the amount of data transferred during the remote neighbor data update 

471 of a process and 3) the time required to solve the outer cells of a process. We 

472 show the dependency of dccrg scalability on these parameters by varying the total 

473 number of cells and processes and by using a MHD solver in one, two and three 

474 dimensions. The run-time AMR scalability of dccrg is also presented. 

475 The non-AMR scalability tests were carried out on three different supercom- 

476 puters: 1) A 2 k core Cray XT5m system with 12 core nodes connected by 

477 SeaStar2 installed at the Finnish Meteorological Institute which we will refer to 

478 as Meteo, 2) A 295 k core IBM Blue Gene/P system with 128 core nodes (node- 

479 boards) installed at the Jiilich Supercomputing Centre which we will refer to as 

480 Jugene, and 3) A 12 k core buUx system with 32 core nodes connected by In- 

481 finiBand installed at the Tres Grand Centre de Calcul which we will refer to as 
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482 



Curie. 



483 4.1. MHD tests without AMR 

484 First we show the time stepping scalability of dccrg in several MHD problems 

485 with a solver developed for the global MHD model GUMICS-4 ([HI) which solves 

486 ideal MHD equations in conservative form. Specifically the solver is a first order 

487 Roe's approximate Riemann solver for a Godunov type problem [26]. In the test 

488 results we present here only the Roe solver from GUMICS-4 is used as we do not 

489 experience problems with negative pressures or densities in the presented tests. 

490 The nature of the solver is such that when solving the flux through a face data 

491 is only required from cells adjacent to the face, irrespective of the size of cells 

492 involved. Thus interpolation of data is not required at any point in the solution 

493 and if a cell has more than one face neighbor in any direction the flux through 

494 each common face is solved in the usual way. In the tests presented here we do 

495 not include a background magnetic field which is used in GUMICS-4 to represent 

496 the Earth's dipole field. 

497 The problems we use are the one-dimensional shock tube presented for exam- 

498 pie in [i27i . the two-dimensional circularly polarized Alfven wave presented by 

499 fl^ and further elaborated on by [29|, and the three-dimensional blast wave pre- 

500 sented by OOll . Periodic boundary conditions are used in all tests, except for the 

501 shock tube test in the direction of the tube where initial conditions are enforced 

502 after every time step. In MHD tests every cell contains the cell-averaged values of 

503 the conservative MHD variables (density, momentum density, total energy density 

504 and magnetic field) giving a total of 128 bytes which must be transferred when up- 

505 dating the data of one cell between two processes. Since only the face neighbors 

506 of a cell are required for calculating the next time step we use a neighborhood 

507 size of zero in dccrg. In these tests processes execute one collective MPI com- 

508 munication per time step in order to dynamically calculate the maximum physical 

509 length of the time step. No other global communication is done. Since the grid is 

510 static in these tests the computational load is balanced only once at the start of the 

511 simulation by using a Hilbert space-filling curve[^ instead of Zoltan. 

512 Figure |6] shows the results of strong scalability tests using MHD with a static 

513 grid in Meteo in one, two and three dimensions. The total number of cells (10 k, 

514 50 k, 0.1 M, 0.5 M and 1 M) is kept constant while the number of MPI processes 

515 is increased from 12 to 1536. In each test case scalability improves with the total 



^https://gitorious.org/sfc/sfc/blobs/master/sfc++.hpp 



17 



516 number of cells because processes have more inner cells to solve while remote 

517 neighbor data is being transferred. For example in the shock tube test every pro- 

518 cess requires the data of two remote neighbors at most while the number of inner 

519 cells with 1.5 k processes increases from about 4 (10 k total cells) to 649 (1 M 

520 total cells). With 1 M cells the one and two dimensional tests scale almost ideally 

521 in Meteo and the three dimensional test is also quite close to ideal. The overall 

522 decrease in scalability with increasing number of dimensions is due to more data 

523 being transferred between processes for the same number of local cells. 

524 As suggested by the scalability results above most of the simulation time is 

525 spent solving MHD which is shown in Figure [7] for the three dimensional blast 

526 wave test using 1 M cells. The only global communication executed per time step 

527 while simulating is the calculation of the maximum length of the physical time 

528 step and is obtained using MPI_Allreduce. This is labeled as AUreduce in Figure 

529 |7] and basically shows the computational and MPI imbalance between processes 

530 due to load balancing. Initialization and file I/O are not included in the profile and 

531 other parts of the simulation take an insignificant fraction of the total run time. 

532 The non-AMR scalability tests were also carried out in Jugene and Curie and 

533 the results for the three-dimensional blast wave are shown in Figures [8] and |9] 

534 respectively. Similarly to Meteo the one and two dimensional tests (not shown) 

535 scale better than the three-dimensional test in both Jugene and Curie. The overall 

536 results are similar in all tested machines, e.g. scalability improves with increasing 

537 number of total cells and decreasing number of dimensions. In Jugene very good 

538 scalability up to 32 k processes is obtained for a total number of cells of 1 M and 

539 above. The total simulation speed in Jugene is only slightly above that of Meteo 

540 mostly due to the relatively small single-core performance of Jugene. Additionally 

541 the average number of cells per process is more than 20 times larger in Meteo than 

542 in Jugene for the maximum number of processes used but this has only a small 

543 effect on scalability in Jugene. 

544 In Curie good scalability up to 8 k processes is obtained only with 64 M total 

545 cells but with a maximum solution speed of nearly 400 M solved cells per second 

546 which is over twice of that in Jugene. We attribute this to the relatively low node 

547 interconnect and high single-core and performance of Curie respectively when 

548 compared to Jugene. 

549 4.2. Scalability of run-time AMR 



Figure 10 shows the speed of pure adaptive mesh refinement in dccrg. Initially 
the grid is 8^, 16^, 64^ or 128^ cells and every process refines all local cells until 
the total size of the grid is 128^ or 256^ cells. Initially the cells were partitioned 
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553 using a space-filling curve and this is not included in the timings. Cells also were 

554 not transferred between processes during AMR. As can be seen in Figure [10] the 

555 maximum cell refining speed of dccrg is of the order of 1 M cells per second. The 

556 linear scalability of AMR up to some 32 MPI processes is explained by the fact 

557 that after changes in the structure of the grid the arrow lists are recalculated only 

558 for local cells and MPI communication has not yet become a bottleneck. At 256 

559 processes the amount of global communication required for updating the structure 

560 of the grid between all processes starts to significantly aff"ect the speed of AMR. 

561 This is discussed further in Section |431 

562 4.3. Scalability of blast wave test with AMR 

Here we present the scalability of dccrg with AMR in the three-dimensional 



blast wave test used in Section 4.1 In this test a procedure similar to the one in 
GUMICS-4 (eq. 2 in [1 J) is used to decide whether to refine or unrefine a cell: 
A refinement index is calculated for each cell based on the relative difference of 
several variables between a cell and its face neighbors. Here the calculation of 
refinement index a additionally includes velocity shear relative to the maximum 
wave velocity from the cells' interface. The full equation for the refinement index 

Ap ^U, (Apf (ABif |A5i| (Av)2 

or = max( — , ^=:z^, :=r, ) 

563 where A denotes the difference in a variable between two cells, the hat denotes a 

564 minimum of the two values (as it actually does also in [IJ), v,,,,-,, = v^-l-(0.01 -Vwavef 

565 and Vwave is the maximum wave velocity from the cells' interface. In this test a cell 

566 is refined if a > 0.02 ■ (Z -I- 1)/L, where / is the cell's current refinement level and 

567 L = 4 is the maximum refinement level of the grid. In other words a cell is 

568 refined to the maximum refinement level if its refinement index exceeds 0.02. A 

569 cell is unrefined if or < 0.02 • (/ -I- l)/L/2, e.g. a cell is kept at refinement level 

570 if a < 0.0025 and none of the cell's neighbors' refinement levels exceed 1 

571 (due to dccrg enforcing a maximum refinement level difference of one between 

572 neighbors). 

573 We use a maximum refinement level of 4 in this test with an initial grid of 

574 25^ cells which results in an effective resolution of 400^ = 64 M cells. The com- 

575 putational load is balanced using Zoltan's recursive coordinate bisection (RGB) 

576 algorithm whenever the fraction of local cells (/^ = Nmax/N,m„, where max and 

577 min are the maximum and minimum number of local cells among all processes 

578 respectively) exceeded a specified limit. Animation 1 (Figure [TT] in the print ver- 

579 sion) shows from left to right, top to bottom the grid, pressure, magnetic and 
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580 kinetic energy density during the simulation (at the end of the simulation in print 

581 version) in the y = plane when grid is adapted at every time step. At the end 

582 of the simulation the fraction of maximum to minimum values are 15 for density 

583 (not shown), 43 for pressure and 2.3 for magnetic energy density. Even though 

584 the MHD solver we use is simpler than the one in OOl the results are still close 

585 due to the high effective resolution achieved by using run-time AMR. 



601 
602 
603 
604 
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586 Figure 12 shows the total solution speed during the simulations as a function of 

587 the number of MPI processes used. In the reference run a CFL [|34i of 0.4 is used, 

588 AMR is done at every time step and the load is balanced whenever the local cell 

589 fraction > 2. The AMR^? runs are otherwise identical to the reference run but 

590 CFL is set to 0.4/N and AMR is done every Nth time step, essentially multiplying 

591 the amount of non-AMR work in these runs by N. The results between different 

592 AMR runs are identical by visual inspection except for increased diffusion in runs 

593 with low CFL. The ratio of work required by AMR and the rest of the simulation 

594 has a significant effect on the total solution speed. The solution speed is a factor 

595 of 5 higher in the AMR32 run than in the reference run when using about 500 MPI 

596 processes. In the reference AMR run the total solution speed is close to 1/10 of 

597 the non-AMR version with up to 144 processes and in the AMR32 the speed is 

598 close to 1 /3 with up to 288 processes after which both fractions start to decrease. 

599 We define these as the regions of excellent AMR scalability. On the other hand 

600 in all of the AMR runs the total solution speed increases up to about 500 to 600 
processes after which it starts to decrease. We define this as the region where 
AMR is scalable. The total number of cells in the AMR runs averages to 4.5 
M which is about 1/14 of the non-AMR version. Consequently in the region of 
excellent scalability the time to solution when using AMR is about 67 % to 22 % 

605 of the non-AMR time for the reference and AMR32 runs respectively. Even with a 

606 higher number of MPI processes it can still be advantagous to use AMR because 

607 the number of simulation cells is over a magnitude lower than without AMR. At 

608 the end of the AMR runs 9.9 M cells exist in the grid and the total number of 

609 cells created and removed is between 40.2 M and 40.7 M, depending mostly on 
the diffusion, and averages to about 91k added -I- removed cells per time step. 



611 Figure 13 shows which parts of the AMR blast wave test require the most 

612 time. As the number of processes is increased the largest fraction of simulation 

613 time is spend in global communication related to AMR and load balancing. The 

614 AUreduce label again indicates global calculation of the physical time step and the 

615 Load balancing label indicates the simulation time spent in load balancing related 

616 functions. At about 300 MPI processes and above the largest fraction of simu- 

617 lation time is spent communicating changes in the structure of the grid between 
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618 all processes. This includes both refining and unrefining cells as well as load bal- 

619 ancing and in each case the MPI_Allgatherv function is used for distributing the 

620 changes in grid structure between all processes. Only a small fraction of the time 

621 spent in load balancing related functions is taken by Zoltan. 

622 5. Discussion 

623 While the ease of use of a software library is subjective it can be quantified 

624 by the number of lines of code required for usage and compared against other 

625 libraries when using the same programming language. With dccrg a complete 

626 parallel program playing Conway's game of life can be implemented in less than 

627 60 LOC including whitespace and comments. Even though the required LOC 

628 is a crude estimate for a software library's ease of use it is nevertheless telling 

629 that such a short parallel program does not seem to be possible with other grid 

630 libraries. 

631 The flexibihty of dccrg also stands out since as far as we know only [|25l al- 

632 lows one to easily exchange arbitrary cell data between MPI processes. Addition- 

633 ally dccrg supports transferring user-defined MPI datatypes which is critical for 

634 performance in some applications. For example when solving the 6 dimensional 

635 Vlasov equation in the Earth's magnetosphere ([7J) the simulation is heavily mem- 

636 ory bound and using MPI datatypes directly for exchanging remote neighbor data 

637 is significantly faster than serializing said data into an additional bufi'er(s) before 

638 transfer. Dccrg also provides automatic neighbor data updates between processes 

639 with the ability of easily overlapping computation with communication. 

640 Currently the largest drawback of dccrg is the fact that the entire structure of 

641 the grid is known by every process, i.e. a part of the mesh metadata is replicated 

642 by all processes. The global data structure prevents grids larger than about 100 M 

643 cells but this has not been a problem for us and can be worked around by storing 

644 blocks instead of single cells into dccrg (similarly to [fT4|. for example). The 

645 global grid data structure of dccrg also reduces the scalability of AMR in the worst 

646 case to about 200 and overall to about 600 MPI processes. Nevertheless using 

647 AMR can lead to significant savings in the required memory as the number of cells 

648 can be one or even two orders of magnitude lower. Depending on the problem the 

649 required CPU time can also be significantly reduced when using AMR especially 

650 when the number of MPI processes used is of the order of 300 or less. It should 

651 also be noted that using threads to parallelize solvers within a shared memory 

652 node could effectively multiply the scalability range of simulations by the number 

653 of cores within one node, but this was not investigated. 
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654 Removing or significantly reducing the global data structure (as done in [EOl . 

655 [|2T| and (221) should improve both the largest attainable grid size and scalability 

656 of AMR considerably. Intuitively this is straightforward since with the exception 

657 of load balancing every process only needs to know the structure of the grid up to 

658 some finite distance from local cells. In order to be able to arbitrarily refine and 

659 unrefine grid cells without global communication local changes in the structure 

660 of the grid must be communicated between all neighboring processes. A neigh- 

661 boring process is defined as any process that has one or more of its cells inside 

662 the neighborhood of any cell of refinement level that overlaps a local cell. In 

663 other words if only level cells exist in the grid then the owners of all remote 

664 neighbors of local cells are considered as neighboring processes; and this holds 

665 no matter how the grid is subsequently refined and unrefined assuming that cells 

666 are not transferred between processes (load balancing). Global communication 

667 can also be avoided during load balancing if, for example, cells can be transferred 

668 only between neighboring processes. Even in this case new neighbor processes 

669 have to be recalculated but global communication is not required because cells 

670 could only have been transferred to/from a subset of all processes. Implementing 

671 this completely distributed mesh metadata is left to a subsequent study. 

672 We presented the distributed cartesian cell-refinable grid (dccrg): an easy to 

673 use parallel structured grid library supporting adaptive mesh refinement and ar- 

674 bitrary C++ classe as cell data. Various MHD scalability results were presented 

675 and depending on the problem, hardware and whether AMR is used excellent to 

676 average scalability is achieved. Dccrg is freely available for anyone to use, study 

677 and modify under version 3 of the GNU Lesser General Public License and can 

678 be downloaded from https://gitorious.org/dccrg. 
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1 #include "boost/f oreach . hpp" 

2 #include "boost/mpi . hpp" 

3 #include "cstdlib" 

4 #include "vector" 

5 #include "zoltan.h" 

6 #define DCCRG CELL DATA SIZE FROM USER 

7 #define DCCRG USER MPI_DATA_TYPE 

8 #include "dccrg.hpp" 
9 

10 struct game_of_lif ecell { 



11 int data[2] ; 

12 void* at() {return &(this->data [0] ) ; } 

13 MPI_Datatype mpi_datatype( ) {return MPI_INT;} 

14 }; 

15 

16 int main(int argc, char* argv[]) { 

17 boost :: mpi :: environment env(argc, argv); 

18 boost: : mpi: : communicator comm; 
19 

20 float zoltan version; 

21 ZoltanInitialize(argc, argv, &zoltan_version) ; 
22 

23 dccrg: :Dccrg<gameof_life_cell> grid; 

24 grid.set geometrydO, 10, 1, 0, 0, 0, 1.0, 1.0, 1.0); 

25 grid.initialize(comm, "HYPERGRAPH" , 1, 0); 

26 grid . balance load () ; 

27 const std : : vector<uint64_t> cells = grid . getcells ( ) ; 
28 

29 // initial state 

30 BOOST FOREACH( const uint64_t cell, cells) { 

31 gameoflif ecell* celldata = grid[cell]; 

32 cell data->data[0] = cell data->data[l] = 0; 

33 if (cell % 4 == 0) cell data->data[0] = 1; 

34 } 

35 

36 for (int turn = 0; turn < 10; turn++) { 

37 grid . update remote neighbor data ( ) ; 
38 

39 // collect live neighbor counts 

40 BOOST_FOREACH( const uint64 t cell, cells) { 

41 gameoflif ecell* celldata = grid[cell]; 

42 cell_data->data[l] = 0; 
43 

44 const std: :vector<uint64_t>* neighbors = grid.get_neighbors(cell) ; 

45 BOOST_FOREACH( const uint64 t neighbor, *neighbors) { 

46 if (neighbor == dccrg : :error cell) continue; 

47 gameoflif ecell* neighbordata = grid[neighbor] ; 

48 if (neighbor_data->data[0] == 1) cell_data->data[l]++; 

49 } 

50 } 

51 // assign new state 

52 BOOST_FOREACH( const uint64 t cell, cells) { 

53 gameoflif ecell* celldata = grid[cell]; 

54 if (cell_data->data[l] == 3) cell_data->data[0] = 1; 

55 else if (cell data->data[l] != 2) cell data->data[0] = 0; 

56 } 

57 } 

58 return 0; 

59 } 
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Figure 4: A complete parallel program playing Conway's Game of Life using dccrg, see the text 
for details. 



1 struct Cell { 

2 unsigned int number of particles ; 

3 std: :vector<boost: :array<double, 3> > particles; 

4 static bool transfer particles; 
5 

6 // returns the starting address of data to send/receive 

7 void* at() { 

8 if (Cell: :transfer_particles) { 

9 if (this->particles . size( ) > 0) { 

10 return &(this->particles [0] ) ; 

11 } else { 

12 // return a sane address 

13 return &(this->number_of_particles) ; 

14 } 

15 } else { 

16 return Si(this->number of particles) ; 

17 } 

18 } 

19 

20 // returns the length in bytes to send/receive 

21 MPIDatatype mpi datatype! ) const { 

22 MPI_Datatype datatype; 

23 if (Cell: :transfer particles) { 

24 MPI_Typecontiguous ( 

25 this->particles . size( ) * sizeof (boost : :array<double, 3>), 

26 MPIBYTE, &datatype) ; 

27 } else { 

28 MPI_Type contiguous (1, MPIUNSIGNED, &datatype); 

29 } 

30 return datatype; 

31 } 
32 

33 void resizeO { 

34 this->particles . resize(this->number_of_particles) ; 

35 } 

36 }; // Struct Cell 
37 

38 int main (...) { 

39 dccrg: :Dccrg<Cell> grid; 
40 

41 for (int Step = 0; step < maxsteps; step++) { 

42 // update number of particles between processes 

43 Cell: :transfer particles = false; 

44 grid . updateremoteneighbordata ( ) ; 

45 

46 // allocate space for particles in copies of remote neighbors 

47 const boost :: unordered_set<uint64_t>* remoteneighbors 

48 = grid . getremotecellswithlocalneighbors ( ) ; 

49 BOOST FOREACH( const uint64 t remote neighbor, *remote_neighbors) { 

50 Cell* data = grid [ remoteneighbor] ; 

51 data->resize( ) ; 

52 } 

53 

54 // update particle data between processes 

55 Cell: :transfer particles = true; 

56 grid . update_remote_neighbor_data( ) ; 

57 } 

58 } 
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Figure 5: Relevant excerpts from a simple kinetic simulation showing how to use dccrg when the 
amount of data in grid cells varies both in space and in time, see text for details. 
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Figure 6: Strong MHD scalability tests results with a static grid in one, two and three dimen- 
sions in Meteo (Cray XT5m). Total number of solved cells per second is shown as a function 
of the number of processes used and the total number of cells in each simulation. Ideal line is 
extrapolated from the 12 process result with 1 M cells. 
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Static grid blast wave profile 
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Figure 7: Profile of the three dimensional MHD blast wave test without AMR using 1 M cells. 
Alkeduce labels the only global communication executed each time step where the maximum 
length of the physical time step is obtained using MPI_Allreduce. Initialization and file I/O are 
not included. 



30 



Blast wave scalability in jugene 
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Figure 8: Strong MHD scalability tests results with a static grid in three dimensions in Jugene 
(IBM Blue Gene/P). Total number of solved cells per second is shown as a function of the number 
of processes used and the total number of cells in each simulation. Ideal line is extrapolated from 
the 128 process result with 1 M cells. 
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Blast wave scalability in curie 
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Figure 9: Strong MHD scalability tests results with a static grid in three dimensions in Curie 
(bullx InfiniBand). Total number of solved cells per second is shown as a function of the number 
of processes used and the total number of cells in each simulation. Ideal line is extrapolated from 
the 32 process result with 64 M cells. 
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speed of pure adaptive mesh refinement 
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Figure 10: Speed of adaptive mesh refinement in dccrg. Initial size of the grid is 8\ 16^643 or 
128^ cells. Every process refines each local cell until the total size of the grid is 128^ or 256^ cells. 
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Figure 11: Adaptive mesh refinement used in a MHD blast wave test (from |30|) showing from 
left to right, top to bottom the grid, pressure, magnetic and kinetic energy density during the 
simulation (final state of simulation in the print version) in the y = plane when grid is adapted at 
every time step. At the end of the simulation the fraction of maximum to minimum values are 15 
for density (not shown), 43 for pressure and 2.3 for magnetic energy density. 
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Blast wave AMR scalability in meteo 
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Figure 12: Scalability of adaptive mesh refinement used in a MHD blast wave test (f30l). In 
the reference run a CFL of 0.4 is used, AMR is done at every time step and the load is balanced 
whenever the local cell fraction fc exceeded 2, where fc - Nmax/N,„i„ and max and min are the 
maximum and minimum number of local cells among processes respectively. The AMRai runs are 
otherwise identical to the reference run but CFL is set to 0.4/N and AMR is done every Nth time 
step, essentially multiplying the amount of non-AMR work in these runs by N. 
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Run-time AMR blast wave profile 
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Figure 13: Profile of the three dimensional MHD blast wave test with AMR using at most 10 M 
cells. Allreduce labels the calculation of the maximum length of the physical time step obtained 
using MPI_Allreduce. Initiahzation and file I/O are not included. 
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